Method and apparatus for extracting information from forms

ABSTRACT

A system for extracting handwritten or typed information from forms that have been printed in colors other than the color of the handwritten or typed information. The information extraction system includes a detector for detecting color values for scanned pixel locations on a printed form; a comparator for comparing the color values with reference color values; an identifier for identifying ones of the scanned pixel locations that have color values that correspond to the reference color values; and an optical character recognition engine for receiving data regarding the identified locations.

BACKGROUND OF THE INVENTION

1. Field of the Invention:

The present invention generally relates to systems for extractinginformation from printed forms by using optical character recognitionscanners.

2. State of Art:

Although optical character recognition (OCR) scanners are well known, itis still common practice to manually extract information from printedforms. For example, information which has been written or typed ontomedical forms and the like is usually extracted manually. Manualextraction of information from printed forms is time-consuming andsubject to human error, but the extraction of information from formswith OCR scanners can also create errors.

FIG. 1 shows examples of situations that can cause conventional OCRscanning systems to err in extracting information from printed forms.Generally speaking, the errors occur because information that has beentyped or written onto a printed form is slightly mis-positioned. Forinstance, the drawing shows characters 15 that have been typed onto aform at positions outside of zones defined by printed vertical lines 14.In addition, the drawing shows characters 16 that are positioned suchthat they descend over a printed horizontal line 14.

Characters 15 and 16 in FIG. 1 may be incorrectly extracted from aprinted form by a conventional OCR scanning system because the OCRcontrol system is confused by the placement of the characters acrossprinted lines. More particularly, an OCR system may operate to onlyidentify information which is printed in certain pre-defined readingzones and, therefore, may omit information which is printed or typedonto a form in transgression of its reading zones.

In the prior art, OCR scanning systems have been proposed that operatein ways to reduce the above-discussed difficulties in extractinghandwritten or typed information from printed forms. For example, aworkstation for extracting information from printed forms havingparticular colors is described in a brochure entitled "The Future DataEntry Workstation--POLYFORM--The Form Reader for Automatic CharacterReading from Forms and Documents Written by Hand or Machine".

FIG. 2 shows a simplified example of one of the POLYFORM workstations.Generally speaking, the workstation includes a light source 2 whichscans a beam 4 across a colored form 6. Interposed between the lightsource and the colored form is a wheel 8 comprised of filters, each ofwhich has a different color. In operation of the workstation, aparticular color filter is selected to match the color of the printedform, thereby allowing an OCR scanner 12 to discriminate typed orhandwritten information from information printed on the form--providedthat the handwritten or typed information has a different color than theprinted form.

The system of FIG. 2 has several disadvantages. One disadvantage is thata different color filter must be selected whenever the color of a formis changed. Moreover, the filter must be selected manually, since thesystem lacks any intrinsic means of determining the required color ofthe filter. Another disadvantage is that the system usually cannotsuccessfully extract information from multi-color forms. For instance,the system may not be able to successfully extract information from pinkforms that have red high-lighted sections or blue sections.

A further disadvantage of the system of FIG. 2 is that the system canonly operate upon forms of a limited number of colors. This limitationfollows from the fact that, for practical reasons, the color wheel cancomprise only a limited number of color filters. In a commercial sense,this limitation may be the most critical of all--since the system maybecome inoperative when there are relatively slight changes in colorfrom one form to another due, for example, to aging by prolongedexposure to bright sunlight or to different printing runs.

SUMMARY OF THE INVENTION

Generally speaking, the present invention provides improved systems forextracting handwritten or typed information from forms that have beenprinted in one or more colors that are different than the color of thehandwritten or typed information. In the preferred embodiment of thepresent invention, the information extraction system includes means fordetecting two or more color values for scanned pixel locations on aprinted form; means for comparing the color values with two or morereference color values; and an optical character recognition engine forreceiving data derived from the comparisons.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention can be further understood from the followingdetailed description in conjunction with the appended drawings. In thedrawings:

FIG. 1 is a drawing that provides examples of circumstances that cancause errors in the extraction of handwritten or typed information froma printed form;

FIG. 2 is a schematic diagram of a system according to the prior art;

FIG. 3 is a functional block diagram of a system according to thepresent invention;

FIG. 4 is a functional block diagram of a dichroic filtering system foruse with the system of FIG. 3;

FIGS. 5 is a schematic diagram of a circuit for use with the system ofFIG. 3;

FIGS. 6 and 7 are flow charts that illustrate two algorithms for usewith the system of FIG. 3; and

FIGS. 8a through 8e are diagrams that illustrate operation of the systemof FIG. 3.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS OF THE INVENTION

FIG. 3 shows a system for extracting information that has beenhandwritten or typed in one color onto a form 28 printed in a differentcolor. For example, the system of FIG. 3 can be employed for extractinginformation that has been typed in black ink onto a yellow form that hasbeen printed with red ink.

The system in FIG. 3 generally includes a multi-chromatic light source24 for scanning the form 28 and a color convertor 32 for detecting colorvalues of scanned areas (i.e., pixel locations) on the form. As will bedescribed below, usually three primary color values (e.g., red, greenand blue) are detected for each pixel location. Further, the systemincludes a comparator means 33 for receiving signals from colorconvertor 32 for each of the detected colors. Comparator means 33 isalso connected to a map means 35 for receiving reference color valuesthat serve as a basis for comparison. Still further, a buffer 34 isconnected for storing digital output signals from comparator means 33.The output of the buffer is connected to a conventional OCR engine 36.

The color convertor 32 in FIG. 3 can have any one of severalconventional embodiments. For example, the color convertor can be astaught in U.S. Pat. No. 4,709,144 (Vincent), the entire disclosure ofwhich is incorporated herein by reference. The color convertor 32 can,alternatively, be embodied as the dichroic filter which is shown in FIG.4.

In the FIG. 4 embodiment, the color convertor comprises a three-facetprism, generally designated by the number 37, a lens 38 for directingcollimated light onto one facet of the prism, an array of photosensitivedetectors 43-45 that receive light emerging from the prism, and an arrayof analog-to-digital convertors 48-50 that receive signals fromdetectors 43-45, respectively. It should be understood that at least twoof the facets of prism 37, including the facet that initially receiveslight from lens 38, are coated to provide dichroic filtering. Suitablecoatings are well known and usually comprise thin, multi-layerinterference coatings.

In practice, the photosensitive detectors 43-45 in FIG. 4 can becharge-coupled diodes (CCDs). Such detectors each provide at least onephotosite for sensing incident light. Output signals from such detectorsnormally are analog signals.

In operation of the dichroic filter of FIG. 4, collimated light fromlens 38 is initially separated into two components, one of which isreflected from the first facet of the prism and one of which istransmitted through the prism. At the first facet, the color separation,or "filtering", action is due to optical interference at the thin filmlayer on the facet. The same filtering action occurs as light strikesother facets of the prism.

In the embodiment of the dichroic filtering system shown in FIG. 4, thefirst facet of prism 37 reflects red light and transmits green and bluelight. The second facet of the prism reflects blue light whiletransmitting green light. Accordingly, photosensitive detector 43receives red light, detector 44 receives blue light, and detector 45receives green light.

Further in operation of the dichroic filtering system of FIG. 4, analogsignals from detectors 43, 44, and 45 are converted to digital values bythe analog-to-digital convertors 48, 49, and 50, respectively. Thus, thesignals from the convertors represent, in digital terms, the primarycolor values of the multi-chromatic light which is incident on the firstfacet of dichroic prism 37. In the illustrated embodiment, the output ofanalog-to-digital convertor 48 represents red values of the incidentlight, the output of convertor 49 represents blue values, and the outputof convertor 50 represents green values. The term "tri-stimulus" will beused in the following to describe those red, green and blue valuescollectively.

In typical practice, each primary color value is digitally representedwith six to eight binary bits and, accordingly, eighteen to twenty-fourbits are available to describe the tri-stimulus color content of eachpixel location. With such resolution, a wide variety of inks and colorsof forms can be employed in the system of FIG. 3. Accordingly, thesystem is not limited to use with forms of particular colors or, indeed,to single color forms.

Additional circuitry, not shown in FIG. 4, can be provided fornormalizing and calibrating output signals from convertors 48-50. Thenormalization circuitry can, for example, compensate for non-linearoptical detectors. Also, multiple optical detectors can be provided foreach color channel; for instance, detector 43 could comprise an array ofseparate detectors.

FIG. 5 shows comparator means 33 embodied as a circuit including, amongother elements, a parallel array of six comparators 61-66 which arepaired in three sets. One input to each of the six comparators is areference value provided from a threshold color map 68. For example,comparator 61 receives a "high blue" reference signal from map 68 andits paired comparator 62 receives a "low blue" reference signal from themap. Similarly, paired comparators 63 and 64 receive high and low greenreference, respectively, from the map. Also, paired comparators 65 and66 receive high and low red reference signals, respectively. Inpractice, the reference values provided by map 68 are either set to thecolor values of the colored form or to the color values of theinformation to be extracted from the form; usually, the former settingis more convenient

Also in the circuit of FIG. 5, the three pairs of comparators receivethe above-discussed tri-stimulus signals. Specifically, the pairedcomparators 61 and 62 receive detected blue signal values, the pairedcomparators 63 and 64 receive detected green signal values, and thepaired comparators 65 and 66 receive detected red signal values.

Further in the circuit of FIG. 5, output lines from the sets of pairedcomparators are connected to AND gates 81, 83, and 85, respectively.Specifically, output lines from paired comparators 61 and 62 areconnected to AND gate 81, output lines from paired comparators 63 and 64are connected to AND gate 83, and output line from the pairedcomparators 65 and 66 are connected to AND gate 85. Finally, outputlines from each of the three AND gates are connected to an AND gate 87.

Generally speaking, the circuit of FIG. 5 operates such that the outputof AND gate 87 is a single binary bit (i.e., a binary "0" or "1") whichindicates whether scanned information should be extracted from a printedform.

In one specific mode of operation of the circuit of FIG. 5, one of thecomparators in each of the three pairs produces a "HI" output (i.e., abinary 1) if a detected primary color value for a given pixel locationis greater than, or equal to, the low threshold value for that primarycolor and, otherwise, will produce a "LO" output signal (i.e., a binary0). For example, comparator 66 would provide a binary HI output if thedetected red value at a given pixel location were greater than, or equalto, the low red reference value provided by color map 68. Otherwise,comparator 66 would provide a LO output. Similarly, comparator 64 wouldproduce a HI output if the detected green value for the given pixellocation were greater than, or equal to, the low green referenceprovided by map 68 and, otherwise, would produce a LO output. And,comparator 62 would operate in a manner similar to comparators 64 and66.

Further as to the example provided in the preceding paragraph, thesecond comparator in each of the three pairs will produce a HI output ifthe detected color value for the given pixel location is less than, orequal to, the high threshold value for the given color and, otherwise,would produce a LO output. For example, comparator 65 will produce a HIoutput if the detected red value for the given pixel location is lessthan, or equal to, the high reference red color value. The comparators63 and 61 will operate similarly.

Thus, with respect to operation of the circuit of FIG. 5 according tothe specific example provided above, output signals from bothcomparators in any one of the pairs will be HI only if the detectedcolor value for a given pixel location is between the low and highthreshold values for the color which is referenced to that pair ofcomparators. For instance, comparators 63 and 64 will both provide HIoutputs only if the green value of a detected pixel location is betweenthe high threshold green value provided to comparator 63 and the lowthreshold green value provided to comparator 64.

Because AND gates 81, 83, 85 and 87 in the circuit of FIG. 5 will eachprovide HI outputs only if all of their inputs are HI, a HI output fromthe final AND gate 87 indicates that the detected primary color values(i.e., the detected red, green and blue values) for a particular pixellocation are all between the preselected upper and lower thresholdlimits. On the other hand a LO output from AND gate 87 indicates that atleast one of the red, green or blue color values at a scanned locationexceeds at least one of the threshold values. In practice, an invertor(not shown) can be added to the output of AND gate 84 to reverse thebinary output signal from AND gate 87 as desired.

With the preceding example in mind, it can be understood that, forappropriately selected threshold values, AND gate 87 in the circuit ofFIG. 5 can be made to output binary signals whose values indicatewhether information at particular locations should be extracted from aprinted form. In other words, the binary output signals from AND gate 87indicate whether handwritten or typed information on a printed form isto be extracted from the form (provided that the color of the form andthe printing thereon is substantially different from the color of thehandwritten or typed information). Specifically, according to thepreceding example, a binary "1" output from AND gate 87 would indicatethat information at a detected pixel location corresponds to the printedform while, on the other hand, a binary "0" output for the same locationwould indicate that the information is to be extracted from the form.The ranges between the threshold values allow the system of FIG. 3 toaccommodate minor color variations form without creating informationextraction errors.

Stated somewhat differently, the system of FIG. 5 can be described asoperating to code pixel locations as white when the locations do nothave color values that correspond to the approximate color ofinformation to be extracted from a printed form and to black when thelocations have color values that correspond to the approximate value ofinformation to be extracted.

At this juncture, it should be emphasized the preceding explanationprovides only one example of a mode in which the circuit of FIG. 5 couldbe operated. Those skilled in digital logic design will understand thatthe circuit can be operated in different modes depending upon thereference values and comparators selected. Those alternative modes ofoperation may provide advantages not found in the mode of operationdescribed in the preceding example.

Operation of the complete system of FIG. 3 can now be understood.Initially, it should be assumed that a printed form 28 having aparticular color is placed into a position for scanning by light source24. It should also be initially assumed that appropriate referencevalues have been stored in the threshold color map 68 of FIG. 5. Then,as each location on the form is scanned, light is reflected onto colorconvertor 32 and the convertor functions to separately detect the red,green and blue color values for the scanned location. Output signalsfrom the convertor then are provided to comparison means 33. Thecomparison means operates, as described above, to provide binary datathat indicates whether the scanned information is to be extracted fromthe form. After a given pixel location has been identified as containinginformation for extraction, OCR engine 36 can employ the binary data(via buffer 34) to identify text and graphics which have beenhandwritten or typed onto the form.

Further as to operation of the system of FIG. 3, it can be noted thathandwritten or typed information extracted from a printed form can bestored, as in a data base 90, for later two-color printing (e.g., blackand white printing). Also, the information stored in the data base 90can be reproduced and otherwise manipulated independently of the formfrom which the information was extracted.

In an alternative embodiment of the system of FIG. 3, the tri-stimuluscolor values for each pixel location can be converted to gray scalesignals. Then, for pixel locations which contain information which is tobe extracted from a form, the gray scale values can be passed to OCRengine 36 in place of the binary outputs of convertor 33. (The grayscale conversion techniques are well known, and standard algorithms forconverting tri-stimulus color representations to luminance (monochrome)and chrominance representations are provided in many handbooks.) Thegray scale values can then be stored and/or printed to provide text andgraphics in shades of gray.

The buffer 34 in the system of FIG. 3 can serve two primary functions.First, the buffer can collect sufficient scan lines to allow OCR engine36 to perform the character recognition function. Secondly, the buffercan allow the system to accommodate different pixel processing rates byits various components. (Typically, the OCR engine is the slowestcomponent.) In practice, the buffer is usually sized to store the pixelcontents of an entire page; alternatively, it may be interfaced to ascan controller (not shown) to temporarily halt line scanning until theOCR engine has removed the current contents of the buffer.

To further automate the system of FIG. 3, means can be provided to allowthe system to automatically identify a form and its color scheme so thatthreshold values appropriate for the form can be automatically loadedinto the threshold color map for the system. Such means can include, forexample, means for identifying a form identity code that is printed onforms in a standard location (e.g., in the upper left corner of eachform). A suitable code would be one, such as a bar code, consisting ofcharacters that can be recognized by the OCR engine and sequenced tohave a standardized, unambiguous meaning. With such a standardizedpractice, a scanner can be instructed to preliminarily scan each formfor the identity code.

In practice, the above-discussed form identity code may includeinformation about a form in addition to its colors. For example, thecode may indicate location at which data will be found on a form. Suchcodes could increase the speed of a system such as in FIG. 3 since ascanner could be instructed to skip predetermined sections of forms.

FIGS. 6 and 7 show logical flow charts that summarize steps in processesfor extracting information from multi-colored forms. Generally speaking,the flow charts depict processes wherein color values are compared totarget color values at each scanned pixel location. The target colorvalues are equivalent to the above-discussed threshold values.

In the flow chart of FIG. 6, a pixel location is assigned a binary valueof 0 if the detected color value at the location equals or exceeds thetarget color value; otherwise, the location is assigned a binary valueof 1. In the flow chart of FIG. 7, if any one of the primary colorvalues detected at a particular pixel location equals or exceeds acorresponding target value, then all of the detected color values forthat location are converted to their maximum values (e.g., binary 1).However, if any one of the detected color values is less than thecorresponding target value for that location, then all of the detectedcolor values for that location are converted to their minimum values(e.g., binary 0).

Operation of the system of FIG. 3 can be further understood by referenceto FIGS. 8a through 8e. In FIG. 8a, the sixteen squares comprising thelarger square can each be understood to represent a pixel location. Inturn, the sixteen pixel locations can be understood to represent aportion of one of the characters 15 in FIG. 2 which has transgressed aprinted prompt line 14. (It will be remembered that the prompt line isprinted in a color different than the character color.) For purposes ofthis example, it can be assumed that the character is typed or writtenin black ink and that any printing on the form is blue.

During a scan of the pixel locations shown on FIG. 8a, the red and greencomponents of light which strike the blue colored pixel locations areabsorbed. Therefore, the blue colored pixel locations reflect little orno red and green light to color convertor 33 in the system of FIG. 3.Accordingly, for the blue colored pixel locations, the threshold valuesfor the red and green colors will not be reached and the binary outputson the red and green channels will be LO. The binary output on the bluechannel will, in contrast, be HI.

The pixel locations in FIG. 8a which are approximately black also willabsorb the red and green components of the source light. Accordingly,the binary output of convertor 33 will remain "LO" on the red and greenchannels for the black colored pixel locations. Moreover, thoselocations will absorb the blue components of the incident light.Therefore, the blue channel output will be LO for the black coloredlocations.

FIG. 8e shows the converted image for the preceding example. It will benoted that the result of the conversion is to distinguish the blackcharacters from the blue form.

In view of the preceding discussion, it can be understood that thesystem of FIG. 3 can be adapted to extract information of any specifiedcolor (e.g., blue) from any other colors (e.g., red) on a form.Accordingly, the system is operative to extract information frommulti-color forms. Further, the system can be adapted to extractinformation of various colors from a form.

It will be appreciated by those of ordinary skill in the art that thepresent invention can be embodied in still other specific forms withoutdeparting from the spirit or essential characteristics thereof. Forexample, it will be apparent to workers skilled in the art that thecircuit of FIG. 5 and many other of the components of the system of FIG.3 can be implemented in hardware, software, or combination of the two.

In view of the above-discussed alternatives, and others, it should beunderstood that all of the disclosed embodiments are to be considered tobe illustrative and not restrictive. The scope of the present inventionis defined by the appended claims rather than the foregoing description,and all changes that come within the meaning and range of equivalentsthereof are intended to be embraced by the present invention.

What is claimed is:
 1. A system for extracting handwritten and typedinformation of a predetermined color from a form printed in a differentcolor, comprising:scanner means for scanning pixel locations on aprinted form; detector means for detecting at least two color values foreach scanned pixel location; comparator means for comparing the detectedcolor values with reference color values; identifier means foridentifying and distinguishing ones of the pixel locations on theprinted form from adjacent pixel locations by comparison of their colorvalues to the reference color values, which identifier means provides abinary output for each scanned pixel location such that one state of thebinary output represents correspondence between the detected colorvalues of the scanned pixel location and at least one of the referencecolor values and, thereby, indicates whether handwritten and typedinformation on the printed form is to be extracted from the form whenthe color of the printing on the form is substantially different fromthe color of the handwritten and typed information at adjacent locationson the printed form; and an optical character recognition engine forreceiving data regarding identified ones of the pixel locations.
 2. Asystem according to claim 1, wherein scanned pixels having detectedcolor components that equal or exceed said reference color componentsrepresent information which is to be extracted from said form.
 3. Asystem according to claim 1, wherein scanned pixels having detectedcolor components that are less than or equal to said reference colorcomponents represent information which is to be extracted from saidform.
 4. A system according to claim 1 further including means toidentify form identity codes that have been printed on forms in astandard location.
 5. A system according to claim 1 further including athreshold color map and means that automatically identify a form and itscolor scheme such that threshold values appropriate for the form can beautomatically loaded into the threshold color map.
 6. A system forextracting handwritten or typed information from forms that have beenprinted in one or more colors that are different than the color of thehandwritten or typed information, comprising:scanner means for scanningselected pixel locations on a printed form; color convertor meanscomprising a dichroic filter means for detecting at least two colorvalues for each scanned pixel location; comparator means for comparingthe detected color values with reference color values; identifier meansfor providing a binary output for each scanned pixel location whichdistinguishes certain ones of the pixel locations on the printed formfrom adjacent pixel locations on the printed form, with one state of thebinary output representing correspondence between the detected colorvalues of the scanned pixel location and at least one of the referencecolor values to, thereby, identify whether handwritten and typedinformation should be extracted from the form by comparison of colorvalues at the scanned pixel locations with the reference color values;an optical character recognition engine for receiving data regardingidentified ones of the pixel locations; and threshold color map meansthat automatically identify a printed form and its color scheme.
 7. Asystem according to claim 6 wherein scanned pixels having detected colorcomponents that equal or exceed said reference color componentsrepresent information which is to be extracted from said form.